Data Collection

2009

Data Exploration and Visualization

Feature Engineering

The new columns "DST_diff_1" to "DST_diff_24" represent the difference between the current value of the DST (Daylight Saving Time) for a specific hour and the DST value for the same hour on the previous day. For example, "DST_diff_1" represents the difference between the current DST value at hour 1 and the DST value at hour 1 on the previous day. Similarly, "DST_diff_2" represents the difference between the current DST value at hour 2 and the DST value at hour 2 on the previous day, and so on. These difference values can be useful in capturing the changes or variations in DST values between consecutive days for each specific hour. They can provide insights into the daily fluctuations or patterns in DST values, highlighting any shifts or deviations from the previous day's DST values.

Lagged Features: Lag 24 hours: This feature captures the DST values from the previous day at the same hour. It can be useful for capturing daily patterns or dependencies on previous day values. Lag 7 days: This feature captures the DST values from the same day of the week, one week ago. It can be helpful for capturing weekly patterns or dependencies on previous week values. Lag 30 days: This feature captures the DST values from the same day of the month, one month ago. It can be useful for capturing monthly patterns or dependencies on previous month values.

Rolling Statistics: Rolling mean: This feature represents the average DST value over the previous 24 hours. It can help smooth out short-term fluctuations and highlight longer-term trends. Rolling standard deviation: This feature represents the variability or volatility of the DST values over the previous 24 hours. It can provide insights into the stability or variability of the data.

Time-based Features: Day of the week: This feature represents the day of the week (0 for Monday, 6 for Sunday). It can help capture weekly patterns or dependencies on specific weekdays. Month: This feature represents the month of the year (1 for January, 12 for December). It can capture seasonal patterns or dependencies on specific months. Season: This feature represents the quarter of the year (1-4). It can provide a higher-level view of seasonal patterns.

Fourier Transform: Fourier components: These features represent the magnitudes of specific Fourier components of the DST values. The first and second components are commonly used to capture the dominant periodicities or frequencies in the data. Higher magnitudes indicate stronger periodic patterns.

Interactions: Hourly difference: This feature represents the difference between the maximum and minimum DST values within each hour. It can capture the variability or range of values within each hour. Weekday difference: This feature represents the difference between the mean DST values on weekdays and weekends. It can capture differences in DST patterns between weekdays and weekends.

Moving Averages: Moving average: This feature represents the average DST value over the previous 7 days. It can help smooth out longer-term trends and highlight general patterns or changes over a week.

ML models

CME

Based on the classification report, here are some conclusions that can be made:

  1. Accuracy: The overall accuracy of the model is 0.4, indicating that it correctly predicts the CME category for 40% of the instances in the test set.

  2. Precision: The precision score measures the proportion of correctly predicted instances for each class. In this case, the precision is low for most classes, ranging from 0.00 to 0.44. This suggests that the model struggles to accurately predict the minority classes (e.g., fan-like, narrow, ragged), while achieving higher precision for the diffuse and jet-like classes.

  3. Recall: The recall score measures the proportion of actual instances of each class that were correctly predicted. The recall is highest for the diffuse and jet-like classes, indicating that the model performs relatively better at capturing these instances compared to other classes.

  4. F1-score: The F1-score is the harmonic mean of precision and recall and provides a balanced measure of the model's performance. The F1-scores are generally low for most classes, ranging from 0.00 to 0.67.

Overall, the results indicate that the model struggles to accurately classify the CME categories, especially for the minority classes. It may be beneficial to explore other classification algorithms, adjust the model parameters, or gather more data to improve the model's performance. Additionally, considering the imbalanced nature of the classes (e.g., few instances for fan-like, narrow, and ragged), addressing the class imbalance issue could potentially lead to better results.

Based on the results you provided, here are some conclusions that can be made:

  1. Mean DST for each CME category:

    • The CME category 'slow' has the highest mean DST value of -2.04, indicating a less severe geomagnetic disturbance.
    • The CME category 'narrow' has the lowest mean DST value of -4.656, suggesting a more intense geomagnetic disturbance.
  2. Median DST for each CME category:

    • The CME category 'slow' has the lowest median DST value of -2.04, indicating a less severe geomagnetic disturbance.
    • The CME category 'narrow' has the highest median DST value of -7.64, suggesting a more intense geomagnetic disturbance.
  3. Frequency of each CME category:

    • The CME category 'jet-like' is the most frequent category with 44 occurrences, followed by 'faint' with 17 occurrences.
    • The least frequent categories are 'slow' and 'loop-like' with only 1 occurrence each.

These conclusions suggest that certain CME categories may be associated with specific levels of geomagnetic disturbances (as indicated by DST values) and occur with different frequencies. For example, 'narrow' CMEs tend to be associated with more intense disturbances, while 'slow' CMEs are less severe. The frequency of CME categories also varies, with 'jet-like' being the most common.

2011

2022-2023